Non-linear dimension reduction (NLDR) methods provide a low-dimensional representation of high-dimensional data (\(p\text{-}D\)) by applying a non-linear transformation. However, the complexity of the transformations and data structures can create wildly different representations depending on the method and hyper-parameter choices. It is difficult to determine whether any of these representations are accurate, which one is the best, or whether they have missed important structures. The R package quollr has been developed as a new visual tool to determine which method and which hyper-parameter choices provide the most accurate representation of high-dimensional data. The scurve data from the package is used to illustrate the algorithm. Single-cell RNA sequencing (scRNA-seq) data from mouse limb muscles are used to demonstrate the usability of the package.
Non-linear dimension reduction (NLDR) techniques, such as t-distributed stochastic neighbor embedding (tSNE) (Maaten and Hinton 2008), uniform manifold approximation and projection (UMAP) (McInnes et al. 2018), potential of heat-diffusion for affinity-based trajectory embedding (PHATE) algorithm (Moon et al. 2019), large-scale dimensionality reduction Using triplets (TriMAP) (Amid and Warmuth 2019), and pairwise controlled manifold approximation (PaCMAP) (Wang et al. 2021), create wildly different representations depending on the selected method and hyper-parameter choices. It is difficult to determine whether any of these representations are accurate, which one is the best, or whether they have missed important structures.
This paper presents the R package, quollr, which is useful for understanding how NLDR warps high-dimensional space and fits the data. Starting with an NLDR layout, our approach is to create a wireframe model representation, that can be lifted and displayed in the high-dimensional space (Figure 1).
Figure 1: algorithm
The paper is organized as follows. The next section introduces the implementation of the quollr package on CRAN, including a demonstration of the package’s key functions and visualization capabilities. In the application section, we illustrate the algorithm’s functionality for studying a clustering data structure. Finally, we conclude the paper with a brief summary and discuss potential opportunities for using our algorithm.
Our algorithm includes the following steps: (1) scaling the NLDR data, (2) computing configurations of a hexagon grid, (3) binning the data, (4) obtaining the centroids of each bin, (5) indicating neighboring bins with line segments that connect the centroids, and (6) lifting the model into high dimensions (Figure 2). A detailed description of the algorithm can be found in cite the methodology paper.
Figure 2: Key steps for constructing the model on the UMAP layout: (a) NLDR data, (b) hexagon bins, (c) bin centroids, (d) triangulated centroids, and (e) lifting the model into high dimensions. The `Scurve` data is shown.
The development version can be installed from GitHub:
devtools::install_github("JayaniLakshika/quollr")The following demonstration of the package’s functionality assumes quollr has been loaded. We also want to load the built-in data sets scurve and scurve_umap.
scurve is a \(7\text{-}D\) simulated dataset. It is constructed by simulating \(5000\) observations from \(\theta \sim U(-3\pi/2, 3\pi/2)\), \(X_1 = \sin(\theta)\), \(X_2 \sim U(0, 2)\) (adding thickness to the S), \(X_3 = \text{sign}(\theta) \times (\cos(\theta) - 1)\). The remaining variables \(X_4\), \(X_5\), \(X_6\), \(X_7\) are all uniform error, with small variance. scurve_umap is the UMAP \(2\text{-}D\) embedding for scurve data with n_neighbors is \(46\) and min_dist is \(0.9\). Each data set contains a unique ID column that maps scurve and scurve_umap.
The mains steps for the algorithm can be executed by the main function fit_highd_model(), or can be run separately for more flexibility.
This function requires several parameters: the high-dimensional data (highd_data), the emdedding data (nldr_data), the number of bins along the x-axis (b1), the buffer amount as a proportion of data (q), and benchmark value to extract high density hexagons (benchmark_highdens). The function returns an object that includes the scaled NLDR object (nldr_obj), the hexagonal object (hb_obj), the fitted model in both \(2\text{-}D\) (model_2d), and \(p\text{-}D\) (model_highd), and triangular mesh (trimesh_data).
fit_highd_model(
highd_data = scurve,
nldr_data = scurve_umap,
b1 = 15,
q = 0.1,
benchmark_highdens = 5)
Constructing the \(2\text{-}D\) model primarily involves (i) scaling the NLDR data, (ii) binning the data, (iii) obtaining bin centroids, (iv) connecting centroids with line segments to indicate neighbors, and (v) Remove low-density hexagons.
The algorithm starts by scaling the NLDR data to a standard range using the gen_scaled_data() function. This function standardizes the data so that the first embedding ranges from \(0\) to \(1\), while the second embedding scales from \(0\) to the maximum value of the second embedding. The output includes the scaled NLDR data along with the original limits of the embeddings.
scurve_umap_obj <- gen_scaled_data(nldr_data = scurve_umap)
scurve_umap_obj
> $scaled_nldr
> # A tibble: 5,000 × 3
> emb1 emb2 ID
> <dbl> <dbl> <int>
> 1 0.707 0.839 1
> 2 0.231 0.401 2
> 3 0.232 0.215 3
> 4 0.790 0.564 4
> 5 0.761 0.551 5
> 6 0.445 0.721 6
> 7 0.900 0.137 7
> 8 0.247 0.392 8
> 9 0.325 0.542 9
> 10 0.278 0.231 10
> # ℹ 4,990 more rows
>
> $lim1
> [1] -14.42166 13.32655
>
> $lim2
> [1] -12.43687 12.32455
The configurations of a hexagonal grid are determined by the number of bins and the bin width in each direction. The function calc_bins_y() is used for this purpose. This function accepts an object containing scaled NLDR data in the first and second columns, along with numeric vectors that represent the limits of the original NLDR data, the number of bins along the x-axis (b1), and the buffer amount as a proportion (q).
bin_configs <- calc_bins_y(
nldr_obj = scurve_umap_obj,
b1 = 15,
q = 0.1)
bin_configs
> $b2
> [1] 16
>
> $a1
> [1] 0.08326136
>
> $a2
> [1] 0.07210645
Points are allocated to bins based on the nearest centroid. The hexagonal binning algorithm can be executed using the hex_binning() function, or its components can be run separately for added flexibility. The parameters used within hex_binning() include an object containing scaled NLDR data in the first and second columns, along with numeric vectors that represent the limits of the original NLDR data (nldr_obj), the number of bins along the x-axis (b1), and the buffer amount as a proportion of the data (q). The output is an object of the hex_bin_obj class, which contains the bin widths in each direction (a1, a2), the number of bins in each direction (bins), the coordinates of the hexagonal grid starting point (start_point), the details of bin centroids (centroids), the coordinates of bins (hex_poly), NLDR components with their corresponding hexagon IDs (data_hb_id), hex bins with their corresponding standardized counts (std_cts), the total number of bins (tot_bins), the number of non-empty bins (non_bins), and the points within each hexagon (pts_bins).
hb_obj <- hex_binning(
nldr_obj = scurve_umap_obj,
b1 = 15,
q = 0.1)
Figure 2: The components of the hexagon grid illustrating notation.
If the hexagonal binning process is run separately, it involves several steps: (i) generating all possible centroids in a hexagonal grid, (ii) creating the coordinates of the hexagons, (iii) assigning data points to their respective hexagons, (iv) computing the standardized number of points within each hexagon, and (v) mapping the points to their corresponding hexagonal bins.
The gen_centroids() function calculates the centroids of a hexagonal grid.
The coordinate limits of the embedding (lim1 and lim2) are used to compute the aspect ratio between the two axes, which informs vertical spacing. The function then calls calc_bins_y(), a helper function that determines the appropriate number of hexagonal rows (bin2) and the width of each hexagon (a1) given the specified number of bins along the x-axis (b1) and buffer (q).
Then, the centroids are computed iteratively. The x-coordinates for centroids in odd-numbered rows are initialized as a sequence spaced by the hexagon width. Even-numbered rows are staggered by half this width to achieve a hexagonal tiling effect. Vertical spacing (a2) is derived by \(\sqrt{3}/2 \times a_1\).
The y-coordinates for each row are similarly calculated, and paired with the x-coordinates based on whether the total number of rows is even or odd. In the case of an odd number of rows, the final row uses only the odd-row x-coordinates to maintain the alternating pattern.
Finally, a tibble is returned containing a unique hexagon ID (h) along with the corresponding x and y centroid coordinates (c_x, c_y), which define the layout of the hexagonal grid over the \(2-\text{D}\) space.
all_centroids_df <- gen_centroids(
nldr_obj = scurve_umap_obj,
b1 = 15,
q = 0.1
)
all_centroids_df
> # A tibble: 240 × 3
> h c_x c_y
> <int> <dbl> <dbl>
> 1 1 -0.1 -0.0892
> 2 2 -0.0167 -0.0892
> 3 3 0.0665 -0.0892
> 4 4 0.150 -0.0892
> 5 5 0.233 -0.0892
> 6 6 0.316 -0.0892
> 7 7 0.400 -0.0892
> 8 8 0.483 -0.0892
> 9 9 0.566 -0.0892
> 10 10 0.649 -0.0892
> # ℹ 230 more rows
Following the generation of hexagonal centroids, the gen_hex_coord() function constructs the coordinates of each hexagonal bin by defining its six polygonal vertices. These coordinates are used to visualize the hexagonal tessellation.
Each hexagon is defined relative to its centroid \((C_x, C_y)\), with six vertices positioned equidistantly around the center. The function first verifies the presence of the required hexagon width parameter a1. This width determines the horizontal spacing.
Two derived constants are calculated to define the relative distances to the vertices. The horizontal and vertical offset is defined as \(dx = \frac{a_1}{2}\), and \(dy = \frac{a_1}{\sqrt{3}}\) repectively. A vertical spacing factor \(v = \frac{a_1}{2\sqrt{3}}\) refines vertical placement in staggered rows.
With these values, the function determines fixed offsets in the x and y directions for all six vertices relative to the centroid. These offsets form two vectors (x_add_factor and y_add_factor) corresponding to the six compass directions used to define the polygon shape: top, top-left, bottom-left, bottom, bottom-right, and top-right.
For each centroid, six vertices are computed and assigned a polygon ID (hex_poly_id) corresponding to the centroid’s h. These points are aggregated into a tibble that includes the polygon ID (hex_poly_id) and the respective x (x) and y (y) coordinates for all hexagon corners.
To reduce computational overhead, the geometry calculations are implemented in C++ using gen_hex_coord_cpp(), which returns a data.frame of vertex coordinates.
all_hex_coord <- gen_hex_coord(
centroids_data = all_centroids_df,
a1 = bin_configs$a1
)
head(all_hex_coord, 5)
> hex_poly_id x y
> 1 1 -0.10000000 -0.04116510
> 2 1 -0.14163068 -0.06520059
> 3 1 -0.14163068 -0.11327155
> 4 1 -0.10000000 -0.13730704
> 5 1 -0.05836932 -0.11327155
After generating the centroids that define the hexagonal grid, the next step is to assign each point in the NLDR embedding to its nearest hexagonal bin. The assign_data() function performs this assignment by calculating the \(2-\text{D}\) Euclidean distance between each point in the \(2-\text{D}\) embedding and all hexagon centroids.
First, the function extracts the first two dimensions of the scaled NLDR embedding, which represent the \(2-\text{D}\) layout. It then selects the corresponding x and y coordinates of each hexagon’s centroid.
Both the embedding coordinates and the centroid coordinates are converted to matrices to facilitate distance computations. The function uses the proxy::dist() method to compute a pairwise Euclidean distance matrix between all NLDR points and all centroids. For each NLDR point, the function identifies the index of the centroid with the smallest distance representing the closest hexagon—and assigns the corresponding hexagon ID (h) to the point.
The result is a data frame of the scaled \(2-\text{D}\) embedding with an additional h column, indicating the hexagonal bin to which each point belongs.
umap_hex_id <- assign_data(
nldr_obj = scurve_umap_obj,
centroids_data = all_centroids_df
)
head(umap_hex_id, 5)
> # A tibble: 5 × 4
> emb1 emb2 ID h
> <dbl> <dbl> <int> <int>
> 1 0.707 0.839 1 205
> 2 0.231 0.401 2 109
> 3 0.232 0.215 3 65
> 4 0.790 0.564 4 146
> 5 0.761 0.551 5 146
The compute_std_counts() function calculates both the raw and standardized counts for each hexagon.
The function begins by grouping the data by h and counting the number of NLDR points falling within each bin. These raw counts are stored as n_h. To enable comparisons across bins with varying densities, the function then standardizes these counts by dividing each bin’s count by the maximum count across all bins. This yields a normalized metric, w_h, ranging from 0 to 1.
std_df <- compute_std_counts(
scaled_nldr_h = umap_hex_id
)
head(std_df, 5)
> # A tibble: 5 × 3
> h n_h w_h
> <int> <int> <dbl>
> 1 21 12 0.0024
> 2 22 17 0.0034
> 3 23 22 0.0044
> 4 24 10 0.002
> 5 25 7 0.0014
The find_pts() function extracts the list of data point identifiers (ID) assigned to each hexagon in the NLDR space.
The function first groups the input data by h, which represents the hexagon label associated with each point in the \(2-\text{D}\) layout. Within each group, it collects the IDs into a list, resulting in a summary data frame where each row corresponds to a single hexagon. The resulting column, pts_list, contains all point identifiers associated with that hexagon.
pts_df <- find_pts(
scaled_nldr_hexid = umap_hex_id
)
head(pts_df, 5)
> # A tibble: 5 × 2
> h pts_list
> <int> <list>
> 1 21 <int [12]>
> 2 22 <int [17]>
> 3 23 <int [22]>
> 4 24 <int [10]>
> 5 25 <int [7]>
The extract_hexbin_centroids() function combines hexagonal bin coordinates, raw and standardized counts within each hexagons.
This function begins by arranging the counts_data by h to ensure consistent ordering. It then performs a full join with centroids_data, aligning hexagon IDs between the two datasets to incorporate both hexagonal bin centroids and count metrics. After merging, the function handles missing values in the count columns: any NA values in w_h or n_h are replaced with zeros. This ensures that hexagons with no assigned data points are retained in the output, with zero values for count-related fields. The resulting data frame contains the full set of hexagon centroids along with associated bin counts and standardized counts.
df_bin_centroids <- extract_hexbin_centroids(
centroids_data = all_centroids_df,
counts_data = hb_obj$std_cts
)
head(df_bin_centroids, 5)
> h c_x c_y n_h w_h
> 1 1 -0.10000000 -0.08923607 0 0
> 2 2 -0.01673864 -0.08923607 0 0
> 3 3 0.06652271 -0.08923607 0 0
> 4 4 0.14978407 -0.08923607 0 0
> 5 5 0.23304543 -0.08923607 0 0
To represent the neighborhood structure of hexagonal bins in a \(2-\text{D}\) layout, we employ Delaunay triangulation on the centroids of hexagons. This geometric approach is used to infer which bins are considered neighbors.
The tri_bin_centroids() function generates a triangulation object from the x and y coordinates of hexagon centroids using the tripack::tri.mesh() function. This triangulation forms the structural basis for identifying adjacent bins.
tr_object <- tri_bin_centroids(
centroids_data = df_bin_centroids
)
The gen_edges() function uses this triangulation object to extract line segments between neighboring bins. It constructs a unique set of bin-to-bin connections by identifying the triangle edges and filtering duplicate or reversed links. Each edge is then annotated with its start and end coordinates, and a Euclidean distance is computed using the helper function calc_2d_dist(). Only edges within a hexagon’s neighborhood radius (based on the hexagon side length a1) are retained.
trimesh <- gen_edges(tri_object = tr_object, a1 = hb_obj$a1)
head(trimesh, 5)
> # A tibble: 5 × 8
> from to x_from y_from x_to y_to from_count to_count
> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
> 1 1 2 -0.1 -0.0892 -0.0167 -0.0892 0 0
> 2 16 17 -0.0584 -0.0171 0.0249 -0.0171 0 0
> 3 16 32 -0.0584 -0.0171 -0.0167 0.0550 0 0
> 4 3 17 0.0665 -0.0892 0.0249 -0.0171 0 0
> 5 17 18 0.0249 -0.0171 0.108 -0.0171 0 0
The update_trimesh_index() function re-indexes the node IDs to ensure that edge identifiers are sequentially numbered and consistent with downstream analysis.
trimesh <- update_trimesh_index(trimesh_data = trimesh)
head(trimesh, 5)
> # A tibble: 5 × 8
> from to x_from y_from x_to y_to from_count to_count
> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
> 1 1 2 -0.1 -0.0892 -0.0167 -0.0892 0 0
> 2 16 17 -0.0584 -0.0171 0.0249 -0.0171 0 0
> 3 16 32 -0.0584 -0.0171 -0.0167 0.0550 0 0
> 4 3 17 0.0665 -0.0892 0.0249 -0.0171 0 0
> 5 17 18 0.0249 -0.0171 0.108 -0.0171 0 0
Not all hexagons contain meaningful information. Some may have very few or no data points due to the sparsity or shape of the underlying structure. Simply removing hexagons with low counts (e.g., fewer than a fixed threshold) can lead to gaps or “holes” in the \(2-\text{D}\) structure, potentially disrupting the continuity of the representation.
To address this, we propose a more nuanced method that evaluates each hexagon not only based on its own density, but also in the context of its immediate neighbors. The find_low_dens_hex() function identifies hexagonal bins with insufficient local support by calculating the average standardized count across their six neighboring bins. If this mean neighborhood density is below a user-defined threshold (e.g., 0.05), the hexagon is flagged for removal.
The find_low_dens_hex() function identifies hexagons with low point densities, considering the densities of their neighboring bins as well. The find_low_dens_hex() function relies on a helper, compute_mean_density_hex(), which iterates over all hexagons and computes the average density across neighbors based on their h and a defined number of bins along the x-axis (b1). The hexagonal layout assumes a fixed grid structure, so neighbor IDs are computed by positional offsets.
find_low_dens_hex(
centroids_data = df_bin_centroids,
b1 = 15,
benchmark_mean_dens = 0.05
)
> [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
> [17] 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
> [33] 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
> [49] 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
> [65] 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
> [81] 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96
> [97] 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112
> [113] 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128
> [129] 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
> [145] 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160
> [161] 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176
> [177] 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192
> [193] 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208
> [209] 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224
> [225] 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240
For simplicity, we remove low-density hexagons using a threshold of \(10\).
The final step involves lifting the fitted \(2\text{-}D\) model into \(p\text{-}D\) by computing the \(p\text{-}D\) mean of data points within each hexagonal bin to represent bin centroids. This transformation is performed using the avg_highd_data() function, which takes \(p\text{-}D\) data (highd_data) and embedding data with their corresponding hexagonal bin IDs as inputs (scaled_nldr_hexid).
df_bin <- avg_highd_data(
highd_data = scurve,
scaled_nldr_hexid = hb_obj$data_hb_id
)
head(df_bin, 5)
> # A tibble: 5 × 8
> h x1 x2 x3 x4 x5 x6 x7
> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
> 1 21 -0.992 1.91 1.11 -0.000427 0.000624 0.00749 0.00105
> 2 22 -0.906 1.93 1.41 -0.0000183 0.00331 -0.0204 -0.000363
> 3 23 -0.680 1.93 1.72 -0.000810 -0.00259 -0.00449 0.00153
> 4 24 -0.272 1.93 1.96 0.00251 0.00668 -0.0460 0.00128
> 5 25 0.0760 1.93 2.00 0.00876 0.00447 0.00851 -0.00195
The predict_emb() function is used to predict \(2\text{-}D\) embedding for a new \(p\text{-}D\) data point using the fitted model. This function is useful to predict \(2\text{-}D\) embedding irrespective of the NLDR technique.
In the prediction process, first, the nearest \(p\text{-}D\) model point is identified for a given new \(p\text{-}D\) data point by computing \(p\text{-}D\) Euclidean distance. Then, the corresponding \(2\text{-}D\) bin centroid mapping for the identified \(p\text{-}D\) model point is determined. Finally, the coordinates of the identified \(2\text{-}D\) bin centroid is used as the predicted NLDR embedding for the new \(p\text{-}D\) data point.
To accelerate this process, the nearest-neighbor search is implemented in C++ using Rcpp via the internal function compute_highd_dist().
predict_data <- predict_emb(
highd_data = scurve,
model_2d = df_bin_centroids,
model_highd = df_bin
)
head(predict_data, 5)
> # A tibble: 5 × 4
> pred_emb_1 pred_emb_2 ID pred_hb_id
> <dbl> <dbl> <int> <int>
> 1 0.691 0.848 1 205
> 2 0.191 0.416 2 109
> 3 0.316 0.199 3 66
> 4 0.774 0.560 4 146
> 5 0.774 0.560 5 146
As a Goodness of fit statistics for the model, glance() is used to compute residuals and RMSE. These metrics are used to assess how well the fitted model will capture the underlying structure of the \(p\text{-}D\) data.
This function begins by renaming the columns of the input model_highd data frame to avoid naming conflicts during subsequent joins. It then uses the predict_emb() function to assign each point in the high-dimensional dataset to a corresponding bin in the 2D model, producing a prediction data frame that contains both the predicted bin assignment (pred_hb_id) and the original observation ID.
The function joins this prediction output with both the high-dimensional model (to get mean bin coordinates in the original space) and the original high-dimensional data (to retrieve true coordinates). It then calculates squared differences between the true and predicted high-dimensional coordinates for each dimension, storing these as error_square_x1, error_square_x2, …, up to the dimensionality of the data.
From these per-dimension errors, the function computes absolute error which is the sum of absolute differences across all dimensions and observations and the RMSE which is the average of the total squared error per point.
These metrics are returned in a tibble as Error (absolute error) and RMSE (root mean squared error). The computation of total absolute error and RMSE is performed in C++ for efficiency using the internal compute_errors() function.
glance(
highd_data = scurve,
model_2d = df_bin_centroids,
model_highd = df_bin
)
> # A tibble: 1 × 2
> Error RMSE
> <dbl> <dbl>
> 1 1481. 0.180
Furthermore, augment() accepts \(2\text{-}D\) and \(p\text{-}D\) model points, and the \(p\text{-}D\) data and adds information about each observation in the data set.
The function starts by renaming columns in the model_highd data frame to avoid conflicts. It then predicts the high-dimensional bin assignments for each point using the predict_emb() function, mapping each observation to its nearest \(2-\text{D}\) bin (pred_hb_id). This prediction is joined with the mean high-dimensional coordinates of each bin from the model and with the original high-dimensional data.
Next, the function computes residuals between each original coordinate (x1, x2, …, xp) and the corresponding modeled coordinate (model_high_d_x1, …, model_high_d_xp) across all dimensions. It calculates both squared errors and absolute errors per dimension. These are used to compute two aggregate diagnostic measures per point. First, the row_wise_total_error which is the total squared error across all dimensions, and the row_wise_abs_error which is the total absolute error across all dimensions.
The final output is a data frame that combines the original IDs, high-dimensional coordinates, predicted bin IDs, modeled coordinates, residuals, row wise total error, absolute error for the fitted values, and row wise total absolute error for each observation. The augmented dataset is always returned as a tibble::tibble with the same number of rows as the passed dataset.
model_error <- augment(
highd_data = scurve,
model_2d = df_bin_centroids,
model_highd = df_bin
)
head(model_error, 5)
> # A tibble: 5 × 32
> ID x1 x2 x3 x4 x5 x6 x7
> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
> 1 1 -0.120 1.64 -1.99 0.0104 0.0125 0.0923 -0.00128
> 2 2 -0.0492 1.51 0.00121 -0.0177 0.00726 -0.0362 -0.00535
> 3 3 -0.774 1.30 0.367 -0.00173 0.0156 -0.0962 0.00335
> 4 4 -0.606 0.246 -1.80 -0.00897 -0.0187 -0.0716 0.00126
> 5 5 -0.478 0.0177 -1.88 0.00848 0.00533 0.0998 0.000677
> # ℹ 24 more variables: pred_hb_id <int>, model_high_d_x1 <dbl>,
> # model_high_d_x2 <dbl>, model_high_d_x3 <dbl>,
> # model_high_d_x4 <dbl>, model_high_d_x5 <dbl>,
> # model_high_d_x6 <dbl>, model_high_d_x7 <dbl>,
> # error_square_x1 <dbl>, error_square_x2 <dbl>,
> # error_square_x3 <dbl>, error_square_x4 <dbl>,
> # error_square_x5 <dbl>, error_square_x6 <dbl>, …
The package offers several \(2-\text{D}\) visualizations, including:
The generated p-D model, overlaid with the data, can also be visualized using show_langevitour. Additionally, it features a function for visualizing the \(2-\text{D}\) projection of the fitted model overlaid on the data, called plot_proj.
Furthermore, there are two interactive plots, show_link_plots and show_error_link_plots, which are designed to help diagnose the model.
Each visualization can be generated using its respective function, as described in this section.
The geom_hexgrid() function introduces a custom ggplot2 layer designed for visualizing hexagonal grid on a provided set of bin centroids.
To display the complete grid, users should supply all available bin centroids.
full_hexgrid <- ggplot() +
geom_hexgrid(
data = hb_obj$centroids,
aes(x = c_x, y = c_y)
)
If the goal is to plot only the subset of hexagons that correspond to bins containing data points, then only the centroids associated with those bins should be passed.
data_hexgrid <- ggplot() +
geom_hexgrid(
data = df_bin_centroids,
aes(x = c_x, y = c_y)
)
The geom_trimesh() function introduces a custom ggplot2 layer designed for visualizing \(2\text{-}D\) wireframe on a provided set of bin centroids.
To display the complete wireframe, users should supply all available bin centroids.
full_triangulation_grid <- ggplot() +
geom_trimesh(
data = hb_obj$centroids,
aes(x = c_x, y = c_y)
)
If the goal is to plot only the subset of hexagons that correspond to bins containing data points, then only the centroids associated with those bins should be passed.
data_triangulation_grid <- ggplot() +
geom_trimesh(
data = df_bin_centroids,
aes(x = c_x, y = c_y)
)
Figure 3: The outputs of geom_hexgrid and geom_trimesh include: (a) a complete hexagonal grid, (b) a hexagonal grid that corresponds with the data, (c) a full grid based on centroid triangulation, and (d) a centroid triangulation grid that aligns with the data.
To evaluate how well the \(p\text{-}D\) model captures the underlying structure of the high-dimensional data, we provide a visualization using the show_langevitour() function. This function renders a dynamic projection of both the high-dimensional data and the model using the langevitour package.
Before plotting, the data needs to be organized into a combined format through the comb_data_model() function. This function takes three inputs: highd_data (the high-dimensional observations), model_highd (high-dimensional summaries for each bin), and model_2d (the hexagonal bin centroids of the model). It returns a tidy data frame combining both the data and the model.
In this structure, the type variable distinguishes between original observations ("data") and the bin-averaged model representation ("model").
df_exe <- comb_data_model(
highd_data = scurve,
model_highd = df_bin,
model_2d = df_bin_centroids
)
The show_langevitour() function then renders the visualization using the langevitour interface, displaying both types of points in a dynamic tour. The edge_data input defines connections between neighboring bins (i.e., the hexagonal edges) to visualize the model’s structure.
In the resulting interactive visualization black points represent the high-dimensional data, green points represent the model centroids from each bin, and the lines between model points reflect the \(2-\text{D}\) wireframe structure mapped back to high-dimensional space.
show_langevitour(
point_data = df_exe,
edge_data = trimesh
)
There are mainly two interactive link plots can be generated.
To support interactive evaluation of how well the \(p\text{-}D\) model captures the structure of the high-dimensional data, we introduce show_link_plots(). This visualization combines two complementary views: the nonlinear dimension reduction (NLDR) representation and a dynamic tour of the model ovelaid the data in the high-dimensional space. Both views are interactively linked, enabling users to explore.
Before visualization, the input data must be prepared using the comb_all_data_model() function. This function combines the high-dimensional data (highd_data), the NLDR data (nldr_data), and the bin-averaged high-dimensional model representation (model_highd) aligned to the \(2-\text{D}\) bin layout (model_2d):
This combined dataset includes both the original observations and the bin-level model averages, labeled with a type variable for distinguishing between them.
df_exe <- comb_all_data_model(
highd_data = scurve,
nldr_data = scurve_umap,
model_highd = df_bin,
model_2d = df_bin_centroids
)
The function show_link_plots() generates two side-by-side, interactively linked plots; a \(2-\text{D}\) NLDR representation, and a dynamic projection tour in the original high-dimensional space (using the langevitour package), displaying both the data and the model. The function takes the output from comb_all_data_model() (point_data) and edge_data which defines connections between neighboring bins.
These two views are linked using crosstalk, allowing interactive selection of points in the NLDR plot to highlight corresponding structures in the langevitour output.
show_link_plots(
point_data = df_exe,
edge_data = trimesh
)
show_error_link_plots() helps to see investigate whether the model fits the points everywhere or fits better in some places, or simply mismatches the pattern.
Before visualization, the input data must be prepared using the comb_all_data_model_error() function. The function requires several arguments: points data which contain high-dimensional data (highd_data), NLDR data (nldr_data), high-dimensional model data (model_highd), \(2-\text{D}\) model data (model_2d), and model error (error_data).
This combined dataset includes both the original observations and the bin-level model averages, labeled with a type variable for distinguishing between them.
df_exe <- comb_all_data_model_error(
highd_data = scurve,
nldr_data = scurve_umap,
model_highd = df_bin,
model_2d = df_bin_centroids,
error_data = model_error
)
The function show_error_link_plots() generates three side-by-side, interactively linked plots; a error distribution, a \(2-\text{D}\) NLDR representation, and a dynamic projection tour in the original high-dimensional space (using the langevitour package), displaying both the data and the model. The function takes the output from comb_all_data_model_error() (point_data) and edge_data which defines connections between neighboring bins.
These two views are linked using crosstalk, allowing interactive selection of points in the NLDR plot to highlight corresponding structures in the high-dimensional projection. This setup facilitates the diagnosis of local distortion, structural artifacts, and model fit quality.
These three views are linked using crosstalk, allowing interactive selection of points in error plot and the NLDR plot to highlight corresponding structures in the langevitour output.
show_error_link_plots(
point_data = df_exe,
edge_data = trimesh
)
Several core computations within quollr are optimized using compiled C++ code via the Rcpp and RcppArmadillo packages. While the user interacts with high-level R functions, performance-critical steps such as nearest-neighbor searches (compute_highd_dist()), error metrics (compute_errors()), \(2-\text{D}\) distance calculations (calc_2d_dist_cpp()), and generation of hexagon coordinates (gen_hex_coord_cpp()) are handled internally in C++. This design provides significant speedups when analyzing large datasets while maintaining a user-friendly R interface. These C++ functions are not exported but are bundled within the package and fully accessible for inspection in the source code.
Single-cell RNA sequencing (scRNA-seq) is a popular and powerful technology that allows you to profile the whole transcriptome of a large number of individual cells (Andrews et al. 2021).
Clustering of single-cell data is used to identify groups of cells with similar expression profiles. NLDR often used to summarise the discovered clusters, and help to understand the results. The purpose of this example is to illustrate how to use our method to help decide on an appropriate NLDR layout that accurately represents the data.
Limb muscle cells of mice in Consortium et al. (2018) are examined. There are \(1067\) single cells, with \(14997\) gene expressions. Following their pre-processing, different NLDR methods were performed using ten principal components. Figure 4 (b) is the reproduction of the published plot. The question is whether this accurately represents the cluster structure in the data. Our method help to provide a better \(2-\text{D}\) layout.
design <- gen_design(n_right = 6, ncol_right = 2)
plot_rmse_layouts(plots = list(error_plot_limb, nldr1,
nldr2, nldr3, nldr4,
nldr5, nldr6), design = design)
Figure 4: Assessing which of the 6 NLDR layouts on the limb muscle data is the better representation using RMSE for varying binwidth (\(a_1\)). Colour used for the lines and points in the left plot and in the scatterplots represents NLDR layout (a-f). Layout d is perform well at large binwidth (where the binwidth is not enough to capture the data struture) and poorly as bin width decreases. Layout f is the best choice.
Figure 6: Compare the published $2-\text{D}$ layout (Figure \@ref(fig:limb-rmse) b) and the $2-\text{D}$ layout selected (Figure \@ref(fig:limb-rmse) f) by RMSE plot (Figure \@ref(fig:limb-rmse)) from the tSNE, UMAP, PHATE, TriMAP, and PaCMAP with different (hyper)parameters. The Limb muscle data ($n = 1067$) has seven close different shaped clusters in $10\text{-}D$.
The quollr package introduces a new framework for interpreting NLDR outputs by fitting a geometric wireframe model in \(2-\text{D}\) and lifting it into high-dimensional space. This lifted model provides a direct way to assess how well a \(2-\text{D}\) layout, produced by methods such as tSNE, UMAP, PHATE, TriMAP, or PaCMAP, preserves the structure of the original high-dimensional data. The approach offers both numerical and visual diagnostics to support the selection of NLDR methods and tuning hyper-parameters that produce the most accurate \(2-\text{D}\) representations.
In contrast to the common practice of visually inspecting scatterplots for clusters or patterns, quollr provides a quantitative route for evaluation. It enables the computation of RMSE and residuals between the original high-dimensional data and the lifted model, offering interpretable diagnostics. These diagnostics are complemented by interactive linked plots and high-dimensional dynamic visualisations using the langevitour package, allowing users to inspect where the model fits well and where it does not.
To support efficient computation, particularly for large-scale datasets, several core functions in quollr are implemented in C++ using Rcpp and RcppArmadillo. These include functions for computing Euclidean distances in high-dimensional and \(2-\text{D}\) space, identifying nearest centroids, calculating residual errors, and generating polygonal coordinates of hexagons. For instance, compute_highd_dist() accelerates nearest neighbor lookup in high-dimensional space, compute_errors() calculates RMSE and total absolute error efficiently, and calc_2d_dist_cpp() speeds up distance calculations in \(2-\text{D}\). Additionally, gen_hex_coord_cpp() constructs the coordinates for hexagonal bins based on their centroids with minimal overhead. These optimizations result in substantial performance gains compared to native R implementations, making the package responsive even when used in interactive contexts or on large datasets such as single-cell transcriptomic profiles.
The modular structure of the package is designed to support both flexibility and reproducibility. Users can access individual functions to control each step of the pipeline such as scaling, binning, and triangulation or use the main function fit_highd_model() for end-to-end model construction. The diagnostics can be used not only to compare NLDR methods but also to tune binning parameters, assess layout stability, and detect local distortions in the embedding.
There are several avenues for future development. While hexagonal binning provides a regular structure conducive to modeling, alternative spatial discretizations (e.g., adaptive binning or density-aware tessellations) could be explored to better capture varying data densities. Expanding support for additional distance metrics in the lifting and prediction steps may improve performance across different domains. Additionally, statistical inference tools could be introduced to assess the stability and robustness of the fitted model, which would enhance interpretability and confidence in the outcomes.
The source code for reproducing this paper can be found at: https://github.com/JayaniLakshika/paper-quollr.
This article is created using knitr (Xie 2015) and rmarkdown (Xie et al. 2018) in R with the rjtools::rjournal_article template. These R packages were used for this work: cli (Csárdi 2025), dplyr (Wickham 2023), ggplot2 (Wickham 2016), interp (>= 1.1-6) (Gebhardt et al. 2024), langevitour (Harrison 2023), proxy(Meyer and Buchta 2022), stats (R Core Team 2025), tibble (Müller and Wickham 2023), tidyselect (Henry and Wickham 2024), crosstalk (Cheng and Sievert 2023), plotly (Sievert 2020), kableExtra (Zhu 2024), patchwork (Pedersen 2024), and readr (Wickham et al. 2024).
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Gamage, et al., "quollr: An R Package for Visualizing 2-D Models from Non-linear Dimension Reductions in High Dimensional Space", The R Journal, 2025
BibTeX citation
@article{paper-quollr,
author = {Gamage, Jayani P. and Cook, Dianne and Harrison, Paul and Lydeamore, Michael and Talagala, Thiyanga S.},
title = {quollr: An R Package for Visualizing 2-D Models from Non-linear Dimension Reductions in High Dimensional Space},
journal = {The R Journal},
year = {2025},
issn = {2073-4859},
pages = {1}
}